The CUDA execution model transforms your computer into a high-performance heterogeneous system. Imagine a Grand Director (the Host/CPU) and an Army of Thousands (the Device/GPU). The Director handles complex logic and decision-making, while the Army performs massive, repetitive tasks simultaneously.
1. The Architectural Divide
The Host is a latency-optimized CPU designed for complex control flow and serial tasks. Conversely, the Device is a throughput-optimized GPU containing thousands of simple cores designed to execute the same instruction across vast datasets simultaneously.
2. The Execution Rhythm
A CUDA program functions as a series of phases. Execution begins on the Host for "serial code." When the program hits a "Parallel Kernel," it launches a Grid of threads onto the Device. Control returns to the Host once the Device finishes its massive workload.
3. Performance Specialization
The model leverages the strengths of both: the CPU manages system resources and complex branches, while the GPU executes SPMD (Single-Program, Multiple-Data) logic to process data elements in parallel.